資料分析3

林嶔 (Lin, Chin)

Lesson 7

第一節:廣義線性模式-1(1)

– 請至這裡下載本週的範例資料

dat = read.csv("Example data.csv", header = TRUE)
head(dat)
##       eGFR Disease Survival.time Death Diabetes Cancer      SBP      DBP
## 1 34.65379       1     0.4771037     0        0      1 121.2353 121.3079
## 2 37.21183       1     3.0704424     0        1      1 122.2000 122.6283
## 3 32.60074       1     0.2607117     1        0      0 118.9136 121.7621
## 4 29.68481       1            NA    NA        0      0 118.2212 112.7043
## 5 28.35726       0     0.1681673     1        0      0 116.7469 115.7705
## 6 33.95012       1     1.2238556     0        0      0 119.9936 116.3872
##   Education Income
## 1         2      0
## 2         2      0
## 3         0      0
## 4         1      0
## 5         0      0
## 6         1      0

– 最常見的Y分別為:

  1. 連續變項 - 假設誤差為常態分佈 (這就是線性迴歸)

  2. 二元變項 - 假設機率服從邏輯斯分布 (這就是邏輯斯迴歸)

第一節:廣義線性模式-1(2)

Formula1 = "eGFR ~ factor(Diabetes) + SBP + factor(Income)"
Result1 = glm(Formula1, family = "gaussian", data = dat)
Result1
## 
## Call:  glm(formula = Formula1, family = "gaussian", data = dat)
## 
## Coefficients:
##       (Intercept)  factor(Diabetes)1                SBP  
##         -44.68079           -0.03554            0.64674  
##   factor(Income)1    factor(Income)2  
##           0.32995            1.18211  
## 
## Degrees of Freedom: 896 Total (i.e. Null);  892 Residual
##   (103 observations deleted due to missingness)
## Null Deviance:       45610 
## Residual Deviance: 2669  AIC: 3536
Result2 = summary(Result1)
Result2
## 
## Call:
## glm(formula = Formula1, family = "gaussian", data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.4186  -1.1374  -0.0075   1.0652   6.0488  
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -44.680787   0.705263 -63.353  < 2e-16 ***
## factor(Diabetes)1  -0.035537   0.162866  -0.218    0.827    
## SBP                 0.646741   0.005857 110.423  < 2e-16 ***
## factor(Income)1     0.329949   0.193766   1.703    0.089 .  
## factor(Income)2     1.182107   0.186438   6.340 3.63e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.992399)
## 
##     Null deviance: 45608.2  on 896  degrees of freedom
## Residual deviance:  2669.2  on 892  degrees of freedom
##   (103 observations deleted due to missingness)
## AIC: 3535.7
## 
## Number of Fisher Scoring iterations: 2
dat[,"Diabetes"] = as.factor(dat[,"Diabetes"])
dat[,"Income"] = as.factor(dat[,"Income"])
glm(dat[,"eGFR"] ~ ., family = "gaussian", data = dat[,c("Diabetes", "SBP", "Income")])
## 
## Call:  glm(formula = dat[, "eGFR"] ~ ., family = "gaussian", data = dat[, 
##     c("Diabetes", "SBP", "Income")])
## 
## Coefficients:
## (Intercept)    Diabetes1          SBP      Income1      Income2  
##   -44.68079     -0.03554      0.64674      0.32995      1.18211  
## 
## Degrees of Freedom: 896 Total (i.e. Null);  892 Residual
##   (103 observations deleted due to missingness)
## Null Deviance:       45610 
## Residual Deviance: 2669  AIC: 3536

第一節:廣義線性模式-1(3)

– 我們挑幾個比較重要的來講…

  1. Estimate: 這是斜率的估計值,一般來說我們檢定他是否等於0,若等於0則代表該因子不影響Y的變化。

  2. Std. Error: 這是斜率的估計值的標準誤,95% ci是用他來計算的。

  3. t value: 這是檢定Estimate是否顯著的檢定值,他等於Estimate/Std. Error。

  4. Pr(>|t|): 這是t value相對應的p value。

– 以上這些數值可以利用函數「ls()」找找看在哪裡…

COEF = Result2$coefficients
COEF
##                       Estimate  Std. Error     t value     Pr(>|t|)
## (Intercept)       -44.68078691 0.705262678 -63.3533977 0.000000e+00
## factor(Diabetes)1  -0.03553692 0.162865536  -0.2181979 8.273248e-01
## SBP                 0.64674071 0.005856955 110.4226792 0.000000e+00
## factor(Income)1     0.32994860 0.193765940   1.7028204 8.895005e-02
## factor(Income)2     1.18210745 0.186438265   6.3404766 3.634238e-10

第一節:廣義線性模式-1(4)

my.pvalue = pt(abs(COEF[,3]), df = Result2$df[2], lower.tail = FALSE) * 2
all.equal(my.pvalue, COEF[,4])
## [1] TRUE
qnorm(0.975)
## [1] 1.959964
pnorm(1.96)
## [1] 0.9750021
qchisq(0.975, df = 1)
## [1] 5.023886
pchisq(3.84, df = 1)
## [1] 0.9499565

第一節:廣義線性模式-1(5)

m = COEF[,1]
s = COEF[,2]

#計算下界
ci.low = m - qt(0.975, df = Result2$df[2]) * s
ci.low
##       (Intercept) factor(Diabetes)1               SBP   factor(Income)1 
##      -46.06495450       -0.35518122        0.63524569       -0.05034167 
##   factor(Income)2 
##        0.81619867
#計算上界
ci.up = m + qt(0.975, df = Result2$df[2]) * s
ci.up
##       (Intercept) factor(Diabetes)1               SBP   factor(Income)1 
##       -43.2966193         0.2841074         0.6582357         0.7102389 
##   factor(Income)2 
##         1.5480162
m.3 = formatC(m, 3, format = "f")
ci.low.3 = formatC(ci.low, 3, format = "f")
ci.up.3 = formatC(ci.up, 3, format = "f")
paste0(m.3, " (", ci.low.3, " to ", ci.up.3, ")")
## [1] "-44.681 (-46.065 to -43.297)" "-0.036 (-0.355 to 0.284)"    
## [3] "0.647 (0.635 to 0.658)"       "0.330 (-0.050 to 0.710)"     
## [5] "1.182 (0.816 to 1.548)"

第一節:廣義線性模式-1(6)

Result3 = glm(dat[,"Disease"] ~ ., family = "binomial", data = dat[,c("Diabetes", "SBP", "Income")])
Result3
## 
## Call:  glm(formula = dat[, "Disease"] ~ ., family = "binomial", data = dat[, 
##     c("Diabetes", "SBP", "Income")])
## 
## Coefficients:
## (Intercept)    Diabetes1          SBP      Income1      Income2  
##   -0.623638     0.353512     0.013074    -0.007878     0.057746  
## 
## Degrees of Freedom: 876 Total (i.e. Null);  872 Residual
##   (123 observations deleted due to missingness)
## Null Deviance:       1013 
## Residual Deviance: 1005  AIC: 1015
Result4 = summary(Result3)
Result4
## 
## Call:
## glm(formula = dat[, "Disease"] ~ ., family = "binomial", data = dat[, 
##     c("Diabetes", "SBP", "Income")])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9365  -1.5102   0.7647   0.8141   0.9526  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -0.623638   0.936715  -0.666   0.5056  
## Diabetes1    0.353512   0.232143   1.523   0.1278  
## SBP          0.013074   0.007823   1.671   0.0947 .
## Income1     -0.007878   0.259658  -0.030   0.9758  
## Income2      0.057746   0.257617   0.224   0.8226  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1013.4  on 876  degrees of freedom
## Residual deviance: 1005.0  on 872  degrees of freedom
##   (123 observations deleted due to missingness)
## AIC: 1015
## 
## Number of Fisher Scoring iterations: 4

第一節:廣義線性模式-1(7)

  1. 線性迴歸是使用t value進行檢定,而邏輯斯迴歸是使用z value進行檢定

  2. 邏輯斯迴歸比較不常使用『斜率』以及其95% CI作結果的呈現,而較常呈現『勝算比(OR, Odds ratio)』以及其95% CI

COEF = Result4$coefficients

#計算p value(使用函數「pnorm()」)
my.pvalue = pnorm(abs(COEF[,3]), lower.tail = FALSE) * 2
all.equal(my.pvalue, COEF[,4])
## [1] TRUE
#計算95% CI
m = COEF[,1]
s = COEF[,2]
ci.low = m - qnorm(0.975) * s
ci.up = m + qnorm(0.975) * s

#轉變為OR (這是唯一增加的部分)
m.OR = exp(m)
ci.low.OR = exp(ci.low)
ci.up.OR = exp(ci.up)

#標準化輸出
m.OR.3 = formatC(m.OR, 3, format = "f")
ci.low.OR.3 = formatC(ci.low.OR, 3, format = "f")
ci.up.OR.3 = formatC(ci.up.OR, 3, format = "f")
paste0(m.OR.3, " (", ci.low.OR.3, " to ", ci.up.OR.3, ")")
## [1] "0.536 (0.085 to 3.361)" "1.424 (0.903 to 2.245)"
## [3] "1.013 (0.998 to 1.029)" "0.992 (0.596 to 1.650)"
## [5] "1.059 (0.639 to 1.755)"

練習1:展示Paper上的資訊

  1. 使用者能隨意輸入1個依變項 & 數個自變項進行運算

2_1. 使用者輸入的依變項若為連續變項,則進行線性回歸運算

2_2. 使用者輸入的依變項若為類別變項,則先確認是否僅有0與1兩個類別,確認成功才進行運算

  1. 運算結束後,計算點估計及95% CI

  2. 將運算出來的95% CI擴充在原來的係數矩陣右邊,呈現類似這樣的狀況:

##             Estimate Std. Error z value  Pr(>|z|) OR (95% CI)             
## (Intercept) "-0.624" "0.937"    "-0.666" "0.506"  "0.536 (0.085 to 3.361)"
## Diabetes1   "0.354"  "0.232"    "1.523"  "0.128"  "1.424 (0.903 to 2.245)"
## SBP         "0.013"  "0.008"    "1.671"  "0.095"  "1.013 (0.998 to 1.029)"
## Income1     "-0.008" "0.260"    "-0.030" "0.976"  "0.992 (0.596 to 1.650)"
## Income2     "0.058"  "0.258"    "0.224"  "0.823"  "1.059 (0.639 to 1.755)"
##                   Estimate  Std. Error t value   Pr(>|t|)
## (Intercept)       "-44.681" "0.705"    "-63.353" "<0.001"
## factor(Diabetes)1 "-0.036"  "0.163"    "-0.218"  "0.827" 
## SBP               "0.647"   "0.006"    "110.423" "<0.001"
## factor(Income)1   "0.330"   "0.194"    "1.703"   "0.089" 
## factor(Income)2   "1.182"   "0.186"    "6.340"   "<0.001"
##                   Slope (95% CI)                
## (Intercept)       "-44.681 (-46.065 to -43.297)"
## factor(Diabetes)1 "-0.036 (-0.355 to 0.284)"    
## SBP               "0.647 (0.635 to 0.658)"      
## factor(Income)1   "0.330 (-0.050 to 0.710)"     
## factor(Income)2   "1.182 (0.816 to 1.548)"

練習1答案(1)

my_func.glm = function (y, x_data) {
  
  lvl.y = levels(factor(y))
  n.lvl.y = length(lvl.y)  
  
  if (n.lvl.y < 10) {
    
    if (mean(lvl.y %in% 0:1) == 1) {
      
      Result = glm(y ~ ., family = "binomial", data = x_data)
      summary_Result = summary(Result)
      COEF = summary_Result$coefficients
      
      m = COEF[,1]
      s = COEF[,2]
      
      #計算下界
      ci.low = m - qnorm(0.975) * s
      
      #計算上界
      ci.up = m + qnorm(0.975) * s
      
      #轉變為OR (這是唯一增加的部分)
      m.OR = exp(m)
      ci.low.OR = exp(ci.low)
      ci.up.OR = exp(ci.up)
      
      #標準化輸出
      m.OR.3 = formatC(m.OR, 3, format = "f")
      ci.low.OR.3 = formatC(ci.low.OR, 3, format = "f")
      ci.up.OR.3 = formatC(ci.up.OR, 3, format = "f")
      
      COEF = cbind(COEF, paste0(m.OR.3, " (", ci.low.OR.3, " to ", ci.up.OR.3, ")"))
      
      colnames(COEF)[5] = "OR (95% CI)"
      
      COEF
      
    } else {
      
      cat("y為類別變項時僅能接受0與1的類別變項\n")
      
    }
    
  } else {
    
    Result = glm(y ~ ., family = "gaussian", data = x_data)
    summary_Result = summary(Result)
    COEF = summary_Result$coefficients

    m = COEF[,1]
    s = COEF[,2]
    
    #計算下界
    ci.low = m - qt(0.975, df = summary_Result$df[2]) * s
    
    #計算上界
    ci.up = m + qt(0.975, df = summary_Result$df[2]) * s
    
    m.3 = formatC(m, 3, format = "f")
    ci.low.3 = formatC(ci.low, 3, format = "f")
    ci.up.3 = formatC(ci.up, 3, format = "f")
    COEF = cbind(COEF, paste0(m.3, " (", ci.low.3, " to ", ci.up.3, ")"))
    
    colnames(COEF)[5] = "Slope (95% CI)"
    
    COEF
    
  }
  
}

練習1答案(2)

dat[,"Income"] = as.factor(dat[,"Income"])

my_func.glm(y = dat$Education, x_data = dat[,c("SBP", "Income")]) 
## y為類別變項時僅能接受0與1的類別變項
my_func.glm(y = dat$Disease, x_data = dat[,c("SBP", "Income")]) 
##             Estimate               Std. Error           
## (Intercept) "-1.21299612535807"    "0.87057456391603"   
## SBP         "0.018343099853456"    "0.00717946299007011"
## Income1     "-0.00331543024980331" "0.254325232647553"  
## Income2     "0.11049049150071"     "0.25577927092588"   
##             z value               Pr(>|z|)            
## (Intercept) "-1.39332824049183"   "0.163520554560428" 
## SBP         "2.5549403735107"     "0.0106205989530202"
## Income1     "-0.0130361829036361" "0.989598925526899" 
## Income2     "0.431975941993863"   "0.665758898626829" 
##             OR (95% CI)             
## (Intercept) "0.297 (0.054 to 1.638)"
## SBP         "1.019 (1.004 to 1.033)"
## Income1     "0.997 (0.605 to 1.641)"
## Income2     "1.117 (0.676 to 1.844)"
my_func.glm(y = dat$eGFR, x_data = dat[,c("SBP", "Income")]) 
##             Estimate            Std. Error            t value            
## (Intercept) "-44.5446073390111" "0.650718644728415"   "-68.4544813643726"
## SBP         "0.645468680757481" "0.00532053317500706" "121.316540941712" 
## Income1     "0.320945790901737" "0.189850880461012"   "1.69051515653964" 
## Income2     "1.18312335438857"  "0.183056090544938"   "6.46317394229569" 
##             Pr(>|t|)               Slope (95% CI)                
## (Intercept) "0"                    "-44.545 (-45.822 to -43.268)"
## SBP         "0"                    "0.645 (0.635 to 0.656)"      
## Income1     "0.0912703998742093"   "0.321 (-0.052 to 0.694)"     
## Income2     "1.66581914375454e-10" "1.183 (0.824 to 1.542)"

第二節:廣義線性模式-2(1)

##                   Estimate  Std. Error t value   Pr(>|t|)
## (Intercept)       "-44.681" "0.705"    "-63.353" "<0.001"
## factor(Diabetes)1 "-0.036"  "0.163"    "-0.218"  "0.827" 
## SBP               "0.647"   "0.006"    "110.423" "<0.001"
## factor(Income)1   "0.330"   "0.194"    "1.703"   "0.089" 
## factor(Income)2   "1.182"   "0.186"    "6.340"   "<0.001"
##                   Slope (95% CI)                
## (Intercept)       "-44.681 (-46.065 to -43.297)"
## factor(Diabetes)1 "-0.036 (-0.355 to 0.284)"    
## SBP               "0.647 (0.635 to 0.658)"      
## factor(Income)1   "0.330 (-0.050 to 0.710)"     
## factor(Income)2   "1.182 (0.816 to 1.548)"
dat[,"Diabetes"] = as.factor(dat[,"Diabetes"])
dat[,"Income"] = as.factor(dat[,"Income"])
Result1 = glm(dat[,"eGFR"] ~ ., family = "gaussian", data = dat[,c("Diabetes", "SBP", "Income")])
Result2 = summary(Result1)

COEF = Result2$coefficients
COEF
##                 Estimate  Std. Error     t value     Pr(>|t|)
## (Intercept) -44.68078691 0.705262678 -63.3533977 0.000000e+00
## Diabetes1    -0.03553692 0.162865536  -0.2181979 8.273248e-01
## SBP           0.64674071 0.005856955 110.4226792 0.000000e+00
## Income1       0.32994860 0.193765940   1.7028204 8.895005e-02
## Income2       1.18210745 0.186438265   6.3404766 3.634238e-10
VCOV = Result2$cov.scaled
VCOV
##              (Intercept)     Diabetes1           SBP      Income1
## (Intercept)  0.497395445  0.0402153065 -4.110204e-03 -0.008966506
## Diabetes1    0.040215306  0.0265251829 -3.688336e-04 -0.001223519
## SBP         -0.004110204 -0.0003688336  3.430393e-05  0.000040765
## Income1     -0.008966506 -0.0012235192  4.076500e-05  0.037545239
## Income2     -0.005350712 -0.0002702795  9.676645e-06  0.004237876
##                   Income2
## (Intercept) -5.350712e-03
## Diabetes1   -2.702795e-04
## SBP          9.676645e-06
## Income1      4.237876e-03
## Income2      3.475923e-02
#將第4項與第5項進行整合
variables = 4:5
DF = length(variables)
CHISQ = t(COEF[variables,1]) %*% solve(VCOV[variables,variables]) %*% COEF[variables,1]
PVALUE = pchisq(CHISQ, df = DF, lower.tail = FALSE)
PVALUE
##              [,1]
## [1,] 1.169025e-09
#僅檢定第4項 (結果近似於使用t value進行檢定)
variables = 4
DF = length(variables)
CHISQ = t(COEF[variables,1]) %*% solve(VCOV[variables,variables]) %*% COEF[variables,1]
PVALUE = pchisq(CHISQ, df = DF, lower.tail = FALSE)
PVALUE
##            [,1]
## [1,] 0.08860168
COEF[4,4]
## [1] 0.08895005

練習2:Wald test

## $SUMMARY
##          Global p value
## Diabetes "0.827"       
## SBP      "<0.001"      
## Income   "<0.001"      
## 
## $COEF
##             Estimate  Std. Error t value   Pr(>|t|)
## (Intercept) "-44.681" "0.705"    "-63.353" "<0.001"
## Diabetes1   "-0.036"  "0.163"    "-0.218"  "0.827" 
## SBP         "0.647"   "0.006"    "110.423" "<0.001"
## Income1     "0.330"   "0.194"    "1.703"   "0.089" 
## Income2     "1.182"   "0.186"    "6.340"   "<0.001"
##             Slope (95% CI)                
## (Intercept) "-44.681 (-46.065 to -43.297)"
## Diabetes1   "-0.036 (-0.355 to 0.284)"    
## SBP         "0.647 (0.635 to 0.658)"      
## Income1     "0.330 (-0.050 to 0.710)"     
## Income2     "1.182 (0.816 to 1.548)"

練習2答案(1)

my_func.glm = function (y, x_data) {
  
  lvl.y = levels(factor(y))
  n.lvl.y = length(lvl.y)  
  
  if (n.lvl.y < 10) {
    
    if (mean(lvl.y %in% 0:1) == 1) {
      
      Result = glm(y ~ ., family = "binomial", data = x_data)
      summary_Result = summary(Result)
      COEF = summary_Result$coefficients
      
      m = COEF[,1]
      s = COEF[,2]
      
      #計算下界
      ci.low = m - qnorm(0.975) * s
      
      #計算上界
      ci.up = m + qnorm(0.975) * s
      
      #轉變為OR (這是唯一增加的部分)
      m.OR = exp(m)
      ci.low.OR = exp(ci.low)
      ci.up.OR = exp(ci.up)
      
      #標準化輸出
      m.OR.3 = formatC(m.OR, 3, format = "f")
      ci.low.OR.3 = formatC(ci.low.OR, 3, format = "f")
      ci.up.OR.3 = formatC(ci.up.OR, 3, format = "f")
      
      COEF = cbind(COEF, paste0(m.OR.3, " (", ci.low.OR.3, " to ", ci.up.OR.3, ")"))
      
      colnames(COEF)[5] = "OR (95% CI)"
      
    } else {
      
      cat("y為類別變項時僅能接受0與1的類別變項\n")
      
    }
    
  } else {
    
    Result = glm(y ~ ., family = "gaussian", data = x_data)
    summary_Result = summary(Result)
    COEF = summary_Result$coefficients

    m = COEF[,1]
    s = COEF[,2]
    
    #計算下界
    ci.low = m - qt(0.975, df = summary_Result$df[2]) * s
    
    #計算上界
    ci.up = m + qt(0.975, df = summary_Result$df[2]) * s
    
    m.3 = formatC(m, 3, format = "f")
    ci.low.3 = formatC(ci.low, 3, format = "f")
    ci.up.3 = formatC(ci.up, 3, format = "f")
    COEF = cbind(COEF, paste0(m.3, " (", ci.low.3, " to ", ci.up.3, ")"))
    
    colnames(COEF)[5] = "Slope (95% CI)"
    
  }
  
  summary_result = matrix(0, ncol = 1, nrow = ncol(x_data))
  rownames(summary_result) = colnames(x_data)
  rownames(summary_result) = colnames(x_data)
  
  num_COEF = summary_Result$coefficients
  VCOV = summary_Result$cov.scaled
  start_val = 2
  
  for (i in 1:ncol(x_data)) {
    if (class(x_data[,i]) == 'factor') {
      end_val = start_val + length(levels(x_data[,i])) - 2
    } else {
      end_val = start_val 
    }
    variables = start_val:end_val
    DF = length(variables)
    CHISQ = t(num_COEF[variables,1]) %*% solve(VCOV[variables,variables]) %*% num_COEF[variables,1]
    summary_result[i,1] = pchisq(CHISQ, df = DF, lower.tail = FALSE)
    start_val = end_val + 1
  }
  
  MY_LIST = list(summary_result, COEF)
  MY_LIST
  
}

練習2答案(2)

dat[,"Income"] = as.factor(dat[,"Income"])

my_func.glm(y = dat$Disease, x_data = dat[,c("SBP", "Income")]) 
## [[1]]
##             [,1]
## SBP    0.0106206
## Income 0.9092092
## 
## [[2]]
##             Estimate               Std. Error           
## (Intercept) "-1.21299612535807"    "0.87057456391603"   
## SBP         "0.018343099853456"    "0.00717946299007011"
## Income1     "-0.00331543024980331" "0.254325232647553"  
## Income2     "0.11049049150071"     "0.25577927092588"   
##             z value               Pr(>|z|)            
## (Intercept) "-1.39332824049183"   "0.163520554560428" 
## SBP         "2.5549403735107"     "0.0106205989530202"
## Income1     "-0.0130361829036361" "0.989598925526899" 
## Income2     "0.431975941993863"   "0.665758898626829" 
##             OR (95% CI)             
## (Intercept) "0.297 (0.054 to 1.638)"
## SBP         "1.019 (1.004 to 1.033)"
## Income1     "0.997 (0.605 to 1.641)"
## Income2     "1.117 (0.676 to 1.844)"
my_func.glm(y = dat$Disease, x_data = dat[,c("Income", "SBP")]) 
## [[1]]
##             [,1]
## Income 0.9092092
## SBP    0.0106206
## 
## [[2]]
##             Estimate               Std. Error           
## (Intercept) "-1.21299612535807"    "0.870574563916031"  
## Income1     "-0.00331543024980338" "0.254325232647553"  
## Income2     "0.11049049150071"     "0.25577927092588"   
## SBP         "0.018343099853456"    "0.00717946299007011"
##             z value               Pr(>|z|)            
## (Intercept) "-1.39332824049183"   "0.163520554560428" 
## Income1     "-0.0130361829036364" "0.989598925526899" 
## Income2     "0.431975941993862"   "0.66575889862683"  
## SBP         "2.5549403735107"     "0.0106205989530202"
##             OR (95% CI)             
## (Intercept) "0.297 (0.054 to 1.638)"
## Income1     "0.997 (0.605 to 1.641)"
## Income2     "1.117 (0.676 to 1.844)"
## SBP         "1.019 (1.004 to 1.033)"

第三節:Cox比例風險模型(1)

– 目前套件『survival』是最常用來做存活分析的套件,他同時支援各式存活分析的相關功能。

install.packages("survival")
library("survival")

第三節:Cox比例風險模型(2)

dat[,"Diabetes"] = as.factor(dat[,"Diabetes"])
dat[,"Income"] = as.factor(dat[,"Income"])
Result5 = coxph(Surv(dat[,"Survival.time"], dat[,"Death"]) ~ ., data = dat[,c("Diabetes", "SBP", "Income")])
Result5
## Call:
## coxph(formula = Surv(dat[, "Survival.time"], dat[, "Death"]) ~ 
##     ., data = dat[, c("Diabetes", "SBP", "Income")])
## 
##                coef exp(coef)  se(coef)      z     p
## Diabetes1 -0.013309  0.986779  0.128309 -0.104 0.917
## SBP       -0.006637  0.993385  0.004487 -1.479 0.139
## Income1   -0.255620  0.774436  0.164798 -1.551 0.121
## Income2    0.107833  1.113862  0.147528  0.731 0.465
## 
## Likelihood ratio test=5.95  on 4 df, p=0.203
## n= 883, number of events= 448 
##    (117 observations deleted due to missingness)
Result6 = summary(Result5)
Result6
## Call:
## coxph(formula = Surv(dat[, "Survival.time"], dat[, "Death"]) ~ 
##     ., data = dat[, c("Diabetes", "SBP", "Income")])
## 
##   n= 883, number of events= 448 
##    (117 observations deleted due to missingness)
## 
##                coef exp(coef)  se(coef)      z Pr(>|z|)
## Diabetes1 -0.013309  0.986779  0.128309 -0.104    0.917
## SBP       -0.006637  0.993385  0.004487 -1.479    0.139
## Income1   -0.255620  0.774436  0.164798 -1.551    0.121
## Income2    0.107833  1.113862  0.147528  0.731    0.465
## 
##           exp(coef) exp(-coef) lower .95 upper .95
## Diabetes1    0.9868     1.0134    0.7674     1.269
## SBP          0.9934     1.0067    0.9847     1.002
## Income1      0.7744     1.2913    0.5607     1.070
## Income2      1.1139     0.8978    0.8342     1.487
## 
## Concordance= 0.535  (se = 0.016 )
## Rsquare= 0.007   (max possible= 0.997 )
## Likelihood ratio test= 5.95  on 4 df,   p=0.2
## Wald test            = 5.79  on 4 df,   p=0.2
## Score (logrank) test = 5.81  on 4 df,   p=0.2

第三節:Cox比例風險模型(3)

– 計算HR的方法與OR類似…

COEF = Result6$coefficients

#計算p value(使用函數「pnorm()」)
my.pvalue = pnorm(abs(COEF[,4]), lower.tail = FALSE) * 2
all.equal(my.pvalue, COEF[,5])
## [1] TRUE
#計算95% CI
m = COEF[,1]
s = COEF[,3]
ci.low = m - qnorm(0.975) * s
ci.up = m + qnorm(0.975) * s

#轉變為HR (與OR完全一樣)
m.HR = exp(m)
ci.low.HR = exp(ci.low)
ci.up.HR = exp(ci.up)

#標準化輸出
m.HR.3 = formatC(m.HR, 3, format = "f")
ci.low.HR.3 = formatC(ci.low.HR, 3, format = "f")
ci.up.HR.3 = formatC(ci.up.HR, 3, format = "f")
paste0(m.HR.3, " (", ci.low.HR.3, " to ", ci.up.HR.3, ")")
## [1] "0.987 (0.767 to 1.269)" "0.993 (0.985 to 1.002)"
## [3] "0.774 (0.561 to 1.070)" "1.114 (0.834 to 1.487)"

第三節:Cox比例風險模型(4)

COEF = Result6$coefficients
COEF
##                   coef exp(coef)   se(coef)          z  Pr(>|z|)
## Diabetes1 -0.013308764 0.9867794 0.12830893 -0.1037244 0.9173881
## SBP       -0.006637461 0.9933845 0.00448735 -1.4791493 0.1391004
## Income1   -0.255619728 0.7744364 0.16479768 -1.5511124 0.1208747
## Income2    0.107832965 1.1138617 0.14752797  0.7309323 0.4648205
VCOV = Result5$var
VCOV
##               [,1]          [,2]          [,3]          [,4]
## [1,]  0.0164631803 -1.936315e-04 -8.050264e-04  4.645830e-04
## [2,] -0.0001936315  2.013631e-05  9.961013e-06 -2.069759e-05
## [3,] -0.0008050264  9.961013e-06  2.715827e-02  2.771585e-03
## [4,]  0.0004645830 -2.069759e-05  2.771585e-03  2.176450e-02
#將第3項與第4項進行整合
variables = 3:4
DF = length(variables)
CHISQ = t(COEF[variables,1]) %*% solve(VCOV[variables,variables]) %*% COEF[variables,1]
PVALUE = pchisq(CHISQ, df = DF, lower.tail = FALSE)
PVALUE
##           [,1]
## [1,] 0.1978171

– 注意:Cox比例風險模型不存在截距項,這點要特別注意喔!

練習3:擴充至存活分析

## $SUMMARY
##          Global p value
## Diabetes "0.917"       
## SBP      "0.139"       
## Income   "0.198"       
## 
## $COEF
##           coef     se(coef) z        Pr(>|z|) HR (95% CI)             
## Diabetes1 "-0.013" "0.128"  "-0.104" "0.917"  "0.987 (0.767 to 1.269)"
## SBP       "-0.007" "0.004"  "-1.479" "0.139"  "0.993 (0.985 to 1.002)"
## Income1   "-0.256" "0.165"  "-1.551" "0.121"  "0.774 (0.561 to 1.070)"
## Income2   "0.108"  "0.148"  "0.731"  "0.465"  "1.114 (0.834 to 1.487)"

練習3答案(1)

library(survival)

my_func = function (y, time = NULL, x_data) {
  
  lvl.y = levels(factor(y))
  n.lvl.y = length(lvl.y)  
  
  if (is.null(time)) {
    
    if (n.lvl.y < 10) {
      
      if (mean(lvl.y %in% 0:1) == 1) {
        
        Result = glm(y ~ ., family = "binomial", data = x_data)
        summary_Result = summary(Result)
        COEF = summary_Result$coefficients
        
        m = COEF[,1]
        s = COEF[,2]
        
        #計算下界
        ci.low = m - qnorm(0.975) * s
        
        #計算上界
        ci.up = m + qnorm(0.975) * s
        
        #轉變為OR (這是唯一增加的部分)
        m.OR = exp(m)
        ci.low.OR = exp(ci.low)
        ci.up.OR = exp(ci.up)
        
        #標準化輸出
        m.OR.3 = formatC(m.OR, 3, format = "f")
        ci.low.OR.3 = formatC(ci.low.OR, 3, format = "f")
        ci.up.OR.3 = formatC(ci.up.OR, 3, format = "f")
        
        COEF = cbind(COEF, paste0(m.OR.3, " (", ci.low.OR.3, " to ", ci.up.OR.3, ")"))
        
        colnames(COEF)[5] = "OR (95% CI)"
        
      } else {
        
        cat("y為類別變項時僅能接受0與1的類別變項\n")
        
      }
      
    } else {
      
      Result = glm(y ~ ., family = "gaussian", data = x_data)
      summary_Result = summary(Result)
      COEF = summary_Result$coefficients
      
      m = COEF[,1]
      s = COEF[,2]
      
      #計算下界
      ci.low = m - qt(0.975, df = summary_Result$df[2]) * s
      
      #計算上界
      ci.up = m + qt(0.975, df = summary_Result$df[2]) * s
      
      m.3 = formatC(m, 3, format = "f")
      ci.low.3 = formatC(ci.low, 3, format = "f")
      ci.up.3 = formatC(ci.up, 3, format = "f")
      COEF = cbind(COEF, paste0(m.3, " (", ci.low.3, " to ", ci.up.3, ")"))
      
      colnames(COEF)[5] = "Slope (95% CI)"
      
    }
    
  } else {
    
    Result = coxph(Surv(time, y) ~ ., data = x_data)
    summary_Result = summary(Result)
    
    COEF = summary_Result$coefficients
    
    m = COEF[,1]
    s = COEF[,3]
    
    #計算下界
    ci.low = m - qnorm(0.975) * s
    
    #計算上界
    ci.up = m + qnorm(0.975) * s
    
    #轉變為HR (這是唯一增加的部分)
    m.HR = exp(m)
    ci.low.HR = exp(ci.low)
    ci.up.HR = exp(ci.up)
    
    #標準化輸出
    m.HR.3 = formatC(m.HR, 3, format = "f")
    ci.low.HR.3 = formatC(ci.low.HR, 3, format = "f")
    ci.up.HR.3 = formatC(ci.up.HR, 3, format = "f")
    
    COEF = cbind(COEF, paste0(m.HR.3, " (", ci.low.HR.3, " to ", ci.up.HR.3, ")"))
    
    colnames(COEF)[6] = "HR (95% CI)"
    
  }
  
  summary_result = matrix(0, ncol = 1, nrow = ncol(x_data))
  rownames(summary_result) = colnames(x_data)
  rownames(summary_result) = colnames(x_data)
  
  num_COEF = summary_Result$coefficients
  if (is.null(time)) {
    VCOV = summary_Result$cov.scaled
    start_val = 2
  } else {
    VCOV = Result$var
    start_val = 1
  }
  
  for (i in 1:ncol(x_data)) {
    if (class(x_data[,i]) == 'factor') {
      end_val = start_val + length(levels(x_data[,i])) - 2
    } else {
      end_val = start_val 
    }
    variables = start_val:end_val
    DF = length(variables)
    CHISQ = t(num_COEF[variables,1]) %*% solve(VCOV[variables,variables]) %*% num_COEF[variables,1]
    summary_result[i,1] = pchisq(CHISQ, df = DF, lower.tail = FALSE)
    start_val = end_val + 1
  }
  
  MY_LIST = list(summary_result, COEF)
  MY_LIST
  
}

練習3答案(2)

dat[,"Income"] = as.factor(dat[,"Income"])

my_func.glm(y = dat$Disease, x_data = dat[,c("SBP", "Income")]) 
## [[1]]
##             [,1]
## SBP    0.0106206
## Income 0.9092092
## 
## [[2]]
##             Estimate               Std. Error           
## (Intercept) "-1.21299612535807"    "0.87057456391603"   
## SBP         "0.018343099853456"    "0.00717946299007011"
## Income1     "-0.00331543024980331" "0.254325232647553"  
## Income2     "0.11049049150071"     "0.25577927092588"   
##             z value               Pr(>|z|)            
## (Intercept) "-1.39332824049183"   "0.163520554560428" 
## SBP         "2.5549403735107"     "0.0106205989530202"
## Income1     "-0.0130361829036361" "0.989598925526899" 
## Income2     "0.431975941993863"   "0.665758898626829" 
##             OR (95% CI)             
## (Intercept) "0.297 (0.054 to 1.638)"
## SBP         "1.019 (1.004 to 1.033)"
## Income1     "0.997 (0.605 to 1.641)"
## Income2     "1.117 (0.676 to 1.844)"
my_func(y = dat$Death, time = dat$Survival.time, x_data = dat[,c("Income", "SBP")])
## [[1]]
##             [,1]
## Income 0.2242644
## SBP    0.1165677
## 
## [[2]]
##         coef                   exp(coef)           se(coef)             
## Income1 "-0.213973925878094"   "0.807369436197337" "0.160557604810633"  
## Income2 "0.135928247771198"    "1.14559969131463"  "0.14487635651984"   
## SBP     "-0.00656382892311547" "0.993457665946688" "0.00418252926047729"
##         z                   Pr(>|z|)            HR (95% CI)             
## Income1 "-1.33269256308639" "0.182632714802121" "0.807 (0.589 to 1.106)"
## Income2 "0.938236238378785" "0.348123019048662" "1.146 (0.862 to 1.522)"
## SBP     "-1.569344412038"   "0.116567709083106" "0.993 (0.985 to 1.002)"
my_func(y = dat$Death, time = dat$Survival.time, x_data = dat[,c("SBP", "Income")])
## [[1]]
##             [,1]
## SBP    0.1165677
## Income 0.2242644
## 
## [[2]]
##         coef                   exp(coef)           se(coef)             
## SBP     "-0.00656382892311547" "0.993457665946688" "0.00418252926047729"
## Income1 "-0.213973925878094"   "0.807369436197337" "0.160557604810633"  
## Income2 "0.135928247771198"    "1.14559969131463"  "0.14487635651984"   
##         z                   Pr(>|z|)            HR (95% CI)             
## SBP     "-1.569344412038"   "0.116567709083106" "0.993 (0.985 to 1.002)"
## Income1 "-1.33269256308639" "0.182632714802121" "0.807 (0.589 to 1.106)"
## Income2 "0.938236238378785" "0.348123019048662" "1.146 (0.862 to 1.522)"

家庭/期中作業

  1. 選擇線性迴歸、邏輯斯迴歸或Cox比例風險模型

  2. 輸入正確的依變項

  3. 輸入有哪些變項想要進行檢定(自變項集合)

  4. 輸入需要共同調整的變項(調整變項集合)

– 注意,假設SBP這個變項同時出現在自變項集合中及調整變項集合中,則在調整時不需要再調整它一次

##    Independent variable Crude-HR (95% CI)     p-value 
## 1  "Diabetes"           ""                    "0.321" 
## 2  "Diabetes:0"         "1.00"                ""      
## 3  "Diabetes:1"         "0.89 (0.71 to 1.12)" "0.321" 
## 4  "SBP"                "0.99 (0.99 to 1.00)" "0.150" 
## 5  "Income"             ""                    "0.212" 
## 6  "Income:0"           "1.00"                ""      
## 7  "Income:1"           "0.81 (0.60 to 1.11)" "0.189" 
## 8  "Income:2"           "1.16 (0.87 to 1.53)" "0.310" 
## 9  "Education"          ""                    "<0.001"
## 10 "Education:0"        "1.00"                ""      
## 11 "Education:1"        "0.75 (0.61 to 0.93)" "0.007" 
## 12 "Education:2"        "0.39 (0.30 to 0.50)" "<0.001"
##    Adj-HR (95% CI)#      p-value 
## 1  ""                    "0.785" 
## 2  "1.00"                ""      
## 3  "0.97 (0.75 to 1.24)" "0.785" 
## 4  "0.99 (0.99 to 1.00)" "0.191" 
## 5  ""                    "0.198" 
## 6  "1.00"                ""      
## 7  "0.77 (0.56 to 1.07)" "0.121" 
## 8  "1.11 (0.83 to 1.49)" "0.465" 
## 9  ""                    "<0.001"
## 10 "1.00"                ""      
## 11 "0.72 (0.58 to 0.89)" "0.003" 
## 12 "0.38 (0.29 to 0.50)" "<0.001"

資料分析總結

– 結合之前的兩節課,大部分期刊中常見的統計方法都已經提到了,儘管你可能不知道是怎麼算的,但一定要熟悉結果的解釋。

– 你應該已經學會這些東西了:

  1. 描述性統計

  2. 單變項統計(類別-類別、類別-連續)

  3. 相關性(類別-類別、類別-連續、連續-連續)

  4. 多變項統計(GLM、Cox model)